Skip to content

Experimental GGUF-2-PTE Converter #13266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dillondesilva
Copy link
Contributor

Summary

This PR is not intended for merge. Instead, it is designed to demonstrate a potential method under which .gguf files can be converted to .pte by leveraging some of the existing transformers ecosystem.

The key idea is the following:

  1. A GGUF model is loaded via the transformers library into a suitable auto class.
  2. Sample tokens are generated using the model tokeniser and a dummy sentence.
  3. GGUF models are then torch exported
  4. The exported torch program is then lowered and exported to an Executorch format .pte

Early Learnings & Limitations

Attached code generates a .pte model that is yet to be tested

The experiment in this PR converts SmolLM2-135M-Instruct-Q8_0.gguf into a .pte file. However, testing if this model works as expected within the executorch runtime is to be confirmed. This may also require some conversion of the tokenizer (I'm not too sure but would be interested to know, its probably in the docs somewhere)?

Torch export errors can be...scary

Conceptually, I thought that running this experiment would hopefully not be too difficult. However some of the setup issues around getting a reliable torch export for certain models proved to be quite challenging. This could just be due to my lack of knowledge on torch.export, however I think it may also offer insight into the opinion of a developer who wants to focus more on having a smooth experience with converting to .pte files.

  • Smoothness of conversion depends largely on model being converted: Tried converting a GGUF for LFM-2 using the same code attached (only changing model_id and filename) and its crazy how scary some of the errors became. Complex traces and rooted in various operators/potentially unsupported ops appeared in the logs. Of course, LFM-2 is quite a new model so perhaps I gave it an unfair example but nonetheless, it could be interesting to see where's the limit of models that can comfortably converted via this workflow.

cc @lucylq

…workflow is shown. Functional but prone to lots of random whacky torch export errors
Copy link

pytorch-bot bot commented Aug 10, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13266

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 2ced5c5 with merge base c8a0706 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 10, 2025
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@dillondesilva dillondesilva changed the title Updated gguf2pte converter. Experimental code to scaffold and depict … Experimental GGUF-2-PTE Converter Aug 10, 2025

torch_dtype = torch.float32
tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dillondesilva what dtype are the weights after loading a GGUF model? Are they dequantized to FP32?

If so, I'm not sure this is really a converter in the sense that it doesn't preserve the quantization from GGUF.

But it is a good start, especially for getting the model structure. We just need to parse the GGUF weights and convert them to int_data/scales/zeros so we can reroute to a kernel. We did have a rudimentary converter for GGUF in torchchat that supported Q4_0 and Q6_K, but this is no longer a popular format.

We could probably start by trying to support Q4_K_M, which requires support for Q4_K and Q6_K. Here is a vibe-coded version of this for Q4_K (so no guarantee that it's correct, but it looks reasonable):

# pip install gguf numpy
import numpy as np
import gguf

# ---- helpers ----
def _fp16le_to_f32(buf_mv):
    return np.frombuffer(buf_mv, dtype="<f2", count=1).astype(np.float32)[0]

def _unpack_q4k_scale_min_codes(bytes12: memoryview):
    """Return two arrays (8,) of 6-bit integers for sub-block scales and mins."""
    b = np.frombuffer(bytes12, dtype=np.uint8)
    # Layout per llama.cpp wiki ("Tensor Encoding Schemes"):
    #  0: EEAAAAAA   1: FFBBBBBB   2: GGCCCCCC   3: HHDDDDDD
    #  4: eeaaaaaa   5: ffbbbbbb   6: ggcccccc   7: hhdddddd
    #  8: eeeeEEEE   9: ffffFFFF  10: ggggGGGG  11: hhhhHHHH
    S0_3 =  b[0:4] & 0x3F
    S4_7 = ((b[0:4] >> 6) & 0x03) | ((b[8:12] >> 4) << 2)

    M0_3 =  b[4:8] & 0x3F
    M4_7 = ((b[4:8] >> 6) & 0x03) | ((b[8:12] & 0x0F) << 2)

    S = np.concatenate([S0_3, S4_7]).astype(np.float32)  # (8,)
    M = np.concatenate([M0_3, M4_7]).astype(np.float32)  # (8,)
    return S, M

def extract_q4k(gguf_path: str, tensor_name: str):
    """
    Returns:
      q_codes  : (n_super, 256) uint8  -- 4-bit codes per superblock (values 0..15)
      scales   : (n_super, 8)  float32 -- per-subblock scale (real units)
      mins     : (n_super, 8)  float32 -- per-subblock min/offset (real units)
      d, dmin  : (n_super,)    float32 -- super-scales used to decode the 6-bit fields
    Notes:
      - Each superblock covers 256 weights = 8 sub-blocks * 32 each.
      - Reconstruct weights for sub-block j:  w = scales[i,j] * q - mins[i,j]
      - Zero-point (affine form): z = mins / scales  (can be fractional)
    """
    r = gguf.GGUFReader(gguf_path)
    t = r.tensors_map[tensor_name]
    raw = memoryview(t.data)

    # Superblock layout (Q4_K):
    # [d fp16][dmin fp16][12B packed S/M codes][128B 4-bit codes]
    stride = 2 + 2 + 12 + 128  # 144 bytes
    n_super = len(raw) // stride
    assert len(raw) % stride == 0, "Unexpected Q4_K tensor byte length"

    d     = np.empty(n_super, dtype=np.float32)
    dmin  = np.empty(n_super, dtype=np.float32)
    S_all = np.empty((n_super, 8), dtype=np.float32)
    M_all = np.empty((n_super, 8), dtype=np.float32)
    Q_all = np.empty((n_super, 256), dtype=np.uint8)

    off = 0
    for i in range(n_super):
        # two fp16 super-scales
        d[i]    = _fp16le_to_f32(raw[off:off+2]); off += 2
        dmin[i] = _fp16le_to_f32(raw[off:off+2]); off += 2

        # packed 6-bit sub-scales / sub-mins
        s12 = raw[off:off+12]; off += 12
        S6, M6 = _unpack_q4k_scale_min_codes(s12)

        # realize to real units
        S_all[i, :] = d[i]    * S6
        M_all[i, :] = dmin[i] * M6

        # 128 bytes => 256 4-bit codes
        codes_b = np.frombuffer(raw[off:off+128], dtype=np.uint8); off += 128
        q_low   = (codes_b & 0x0F).astype(np.uint8)
        q_high  = (codes_b >> 4).astype(np.uint8)
        Q_all[i, 0::2] = q_low
        Q_all[i, 1::2] = q_high

    return Q_all, S_all, M_all, d, dmin

# ---- Example usage ----
# q, s, m, d, dmin = extract_q4k("model.gguf", "model.layers.0.self_attn.q_proj.weight")
# # Dequantize one superblock 'i', sub-block j (32 weights):
# i, j = 0, 3
# w_block = s[i, j] * q[i, j*32:(j+1)*32].astype(np.float32) - m[i, j]
# # Optional affine form zero-point:
# z_block = m[i, j] / s[i, j]

Now we don't currently have any quantized kernels that will handle floating point zeros (in XNNPACK or elsewhere), but I could quickly put up a patch to support that for our lowbit kernels in a day or two.

Copy link
Contributor

@lucylq lucylq Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the example, the flow looks quite clean. Agree with @metascroy that we may need some custom weight conversion.

I was imagining we could export a PTE file without weights, and plug in gguf weights at runtime, but that also requires some more work on export/runtime before it's possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch about the weights being dequantized. I pushed a quick update and it does seem that the GGUF weights are dequantized to FP32 (also found it on the docs)

As you've mentioned, it would be great to have some sort of a conversion module we route the model through once the GGUF has been loaded by HF.

What would be the best path forward for development? Do we want an RFC/some abstractions in this PR we can use to capture this process + any additional steps (e.g. dtype conversion)?

Screenshot 2025-08-12 at 8 31 14 pm

@lucylq
Copy link
Contributor

lucylq commented Aug 11, 2025

cc @swolchok on gguf-pte conversion

@swolchok
Copy link
Contributor

swolchok commented Aug 11, 2025

Are the models in the transformers library guaranteed to be exportable? I was under the impression that we generally needed to curate exportable versions of LLMs at this stage in our development, but perhaps I am out of date.

Also CC @mergennachin

@dillondesilva
Copy link
Contributor Author

Are the models in the transformers library guaranteed to be exportable? I was under the impression that we generally needed to curate exportable versions of LLMs at this stage in our development, but perhaps I am out of date.

Also CC @mergennachin

@swolchok Good point! Welp I haven't done a detailed analysis on this but I think its largely dependent on model architecture/operations within it - feel free to have a play around with changing the model_id + filename in the attached script. Last I checked, LFM2-350M-GGUF did not have a smooth export experience.

Perhaps there's some commonalities between what exports well and what doesn't. If we investigate this, it could help understand the limitations of what is/isn't exportable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants